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Abstract. It is a classical result of Stein and Waterman that the asymptotic number of RNA 
secondary structures is 1.104366 ■ n~^^'^ ■ 2.618034". To provide a better understanding of the 
kinetics of RNA secondary structure formation, we are interested in determining the asymptotic 
number of secondary structures that are locally optimal, with respect to a particular energy 
model. In the Nussinov energy model, where each base pair contributes —1 towards the energy 
of the structure, locally optimal structures are exactly the saturated structures, for which we 
have previously shown that asymptotically, there are 1.07427 ■n.-3/2 -2.35467" many saturated 
structures for a sequence of length n. In this paper, we consider the base stacking energy 
model, a mild variant of the Nussinov model, where each stacked base pair contributes — 1 
toward the energy of the structure. Locally optimal structures with respect to the base stacking 
energy model are exactly those secondary structures, whose stems cannot be extended. Such 
structures were first considered by Evers and Giegerich, who described a dynamic programming 
algorithm to enumerate all locally optimal structures. In this paper, we apply methods from 
enumerative combinatorics to compute the asymptotic number of such structures. Additionally, 
wc consider analogous combinatorial problems for secondary structures with annotated single- 
stranded, stacking nucleotides (dangles). 



1. Introduction 

Historically, the development of combinatorics for RNA secondary structures [41] |46] has been 
intimately related to the development of algorithms for RNA minimum free energy (MFE) sec- 
ondary structure |52| I50| I19| . In particular, counting the number of secondary structures for 
sequence of length n is essentially equivalent to computing the Boltzmann partition function, de- 
fined hy Z = ^gexp{—E{S)/RT), where the sum is taken over all secondary structures S, the 
energy of S is denoted by E{S), R « 1.959 cal/mol is the universal gas constant, and T absolute 
tempcratureQ 

Complex analysis is used to obtain the asymptotic enumeration results described in this article 
and related articles mentioned in the introduction. In particular, given a complex generating 
function f(z) = ^a^^"", it is well-known from introductory complex analysis that / converges in 
a circular region about the point of expansion out to the dominant, or nearest, singularity r, and 
thus the asymptotic order of magnitude of a„ is approximately r^". Darboux's theorerro [34l [TS] 
states that if f(z) = X]^^o(^ ~ z)°'L{z), where r > 0, a is not a positive integer, and L is analytic 
in a disk of radius greater than r, then ~ r"~"?t"~^ r^-i) ' ^^^^ result was generalized by 
Bender [2], corrected by Mcir and Moon [32], and further extended by Flajolet and Odlyzko [13] 
and by Drmota, Lallcy and Woods (each of the latter worked independently) - sec the exposition 
in |14| for discussion and references. 

In [41j . Stein and Waterman proved that the asymptotic number of secondary structures is 
1.104366-n^3/2. 2,618034". Since that time, a number of additional results on the combinatorics of 
RNA structures have been obtained. In [22], Hofacker et al. derived a number of asymptotic results 
on the number of structures, expected number of base pairs, etc. for RNA secondary structures. 
Observing a correspondence with involutions, Haslinger and Stadler [17| provided an upper bound 
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on the number of bi-secondary structures, i.e. structures having non-nested pseudoknots that 
can be presented as a union = 5i U S'2 of disjoint secondary structures, and Rodland [3S] 
studied the asymptotic number of a number of classes of pseudoknotted structures. Building 
on a remarkable and pioneering paper of Harer and Zagier [16], Vernizzi et al. [43] classified 
pseudoknotted RNA structures according to topological genus g, and then applied the work of 
Harer and Zagier to obtain recurrence relations for the number of pseudoknotted structures of 
genus g. In [57] j Saule et al. provided a summary table of the asymptotic number of pseudoknotted 
structures structures, with respect to various allowed pseudoknots, and established the asymptotic 
number of pseudoknotted structures, with no restriction. In [25], Li and Reidys determined 
the asymptotic number of hybridizations of two interacting RNA structures. Moving away from 
counting the number of structures, Yoffc et al. [IS] and Clote et al. [B] determined the asymptotic 
expected distance between the 5' and 3' ends of RNA sequence, where the 5' to 3' distance of 
a given structure S on sequence si,...,s„ is defined as the minimum number of backbone or 
base-pairing edges in a minimum length path from si to s„. 

In [5], Clote computed the asymptotic number 1.07427 • n"'^/^ • 2.35467" of sateroted structures, 
defined by Zuker [49] as those for which no base pair can be added without violating the definition 
of secondary structure. In [8], Clote et al. provided another proof for the asymptotic number 
of saturated structures, which additionally yielded the asymptotic expected number of base pairs 
0.337361 • n for saturated structures. An overview of methods for RNA enumerative combinatorics 
is given in Lorenz et al. |26| , where additionally it is shown that the asymptotic number of shapes 
of secondary structures for a length n sequence is 2.44251 • n~^^'^ ■ 1.32218" ^ In [55] Hofacker et al. 
showed that the asymptotic number of canonical secondary structures (those having no isolated 
base pair) is 2.1614 • n-^/^ . 1.96798", a result that was confirmed by a different method in Clote 
et al. [5], where additionally the expected number of base pairs was shown to be 0.31724 • n. 

A locally optimal, or kinetically trapped, secondary structure S is one for which no secondary 
structure T, obtained from S by the removal or addition of a single base pair, has lower energy. 
It follows that saturated structures are exactly the kinetically trapped structures with respect to 
the Nussinov energy model [33j , in which each base pair receives a stabilizing energy contribution 
of —1. In this paper, we consider the base stacking energy model, in which each stacked base pair 
receives a stabilizing energy contribution of —1. Here, the base pair («, j) in secondary structure 
S is defined to be a stacked base pair, provided that (i — 1, j 4- 1) is also a base pair in 5 - i.e. 
provided that there is an outer base pair that provides a stabilizing stacking energy. In jl2j , Evers 
and Giegerich describe a dynamic programming algorithm to enumerate all structures that are 
locally optimal with respect to the base stacking energy model; i.e. those structures in which no 
stem can be extended. The authors called such structures "saturated" . When a strictly positive 
minimal value is specified for the length of every stem, a structure is saturated in the sense of 
Zuker [49] if and only if it is saturated in the sense of Evers and Giegerich [12]. However, as 
mentioned in [5], when the lengths of stems are not constrained, there are structures that are 
saturated in the sense of Evers and Giegerich [T5], but which are not saturated in the sense of 
Zuker [iS] . For clarity of exposition, we will call a secondary structure G-saturated if no stem can 
be extended. In this paper, we give an enumerative framework based on weighted plane trees that 
allows us to enumerate G-saturated structures (as well as recover the enumeration of secondary 
structures and of saturated structures) . We also consider analogous problems for structures with 
annotated single-stranded, stacked nucleotides (also called dangles). 

Outline of paper. The plan of this paper is as follows. In Section^ we define the notions of sec- 
ondary structure and context free grammar, and provide context free grammars for various classes 
of secondary structures considered in the paper. In that section, we show that the asymptotic 
number of secondary structures with annotated dangles, as computed in the partition function 



■^The shape of a secondary structure was defined by Voss et al. |44| to represent its branching topology; for 
instance, the shape of the well-known clover-leaf structure of tRNA is [[] [] []]. The asymptotic number of 
shapes for a length n sequence yields the run time for the Giegerich Lab software RNAshapes on length n sequences, 
since Steffen et al. 1401 report that RNAshapes runs in time 0{n^ks) for s sequences, each of length at most n and 
k shapes. 
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of the Markham-Zukcr software UNAFOLD gS], is 0.63998 • . 3.06039", exponentially larger 

than the number of all seeondary structures 1.104366 • rr^l"^ ■ 2.618034", previously established 
by Stein and Waterman [41]. This new result provides a partial explanation for M. Zukcr's ob- 
servation (personal communication) that UNAFOLD requires substantially more computation time 
when dangles are includedQ In Section [31 we describe the computation of secondary structure 
melting curves with respect to the Nussinov energy model and the base stacking energy model. 
In particular. Figure [3] shows that folding is more cooperative in the base stacking energy model. 
Additionally, we provide computational evidence using Monte Carlo kinetic folding experiments 
that folding time, as measured by mean first passage time (MFPT), is weakly correlated with the 
number of saturated structures, and highly correlated with minimum free energy. In Section [31 
we describe the correspondence between RNA secondary structures and plane trees, and then 
give generating functions for the number of secondary structures and locally optimal secondary 
structures, with respect to the Nussinov model and the base stacking energy model. In Section [5l 
we give asymptotic results on the number of secondary structures and locally optimal secondary 
structures, as well as their expected number of base pairs. In Section [6l we give similar asymp- 
totic results when annotations for external dangles are included for each type of structure. Finally 
Section [7| summarizes our main contributions. 

2. Definitions 

Definition 1 (Secondary structure). An RNA secondary structure for a given RNA sequence 
ai, . . . , a„ of length n is defined to be a set S of ordered pairs (i, j), with 1 < i < j < n, such that 
the following conditions are satisfied. 

: 1. Watson-Crick and wobble pairs: // (i, j) G S, then {a.,, aj} £ {{A, U}{G, C}{G, U}}. 

: 2. No base triples: // {i,j) and [i, k) belong to S, then j = k; if {i,j) and {k,j) belong to 
S, then i = k. 

: 3. Nonexistence of pscudoknots: If {i,j) and (fc,£) belong to S, then it is not the case that 
i < k < j < £. 

: 4- Threshold requirement for hairpins: // {i,j} belongs to S, then j ^ i > 9, for a fixed 
value 9 > 0; i.e. there must be at least 9 unpaired bases in a hairpin loop. 

For software, such as mfold [SD] and RNAfold [2D|, to predict RNA secondary structure, 9 is 
taken to be 3; i.e., for reasons related to steric constraints, every hairpin is required to contain at 
least three unpaired bases. 

A base pair (i,j) € S is called a link. An element i is said to be linked if it is involved in a 
link and free otherwise. A link («, j) is said to be stacked onto another link {i',j') if i' = i + 1 
and j' — j — I. A stem is a maximal sequence £o, . . . ,£k of links such that £i is stacked onto £i^i 
for < i < k — I; the value k is called the length of the stem. In some applications a threshold 
condition on stems is required: 

: 5. Threshold requirement for stems: Each stem has length at least t, for a fixed value 
r > 0. 

Note that Condition (5) is of no effect for r = 0. 

In this paper, we are concerned with the asymptotic number of locally optimal structures. In 
order to employ generating functions, we will need to assume the homopolymer model (following a 
convention established by Stein and Waterman [41j). meaning that any position can pair with any 
other position (arbitrary base pairs, not only Watson-Crick and wobble pairs). We thus define 
a secondary structure of a homopolymer of length n to be a set S of base pairs {i,j), where 
1 < * < .7 < such that the previous conditions (2,3,4,5) are satisfied. 

The following notion of context free grammar is used for two reasons: (1) to provide a clean and 
succinct definition for RNA secondary structure, with respect to a particular energy model, and 
(2) for certain enumeration results. See Lorenz et al. [55] for more on context free grammars and 
their application to combinatorics. In particular, we refer the reader to [26| for an explanation of 



Note that UNAFOLD is currently the only software that computes the partition function over all secondary 
structures in a mathematically rigorous manner. 
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the DSV method used in this article, which ahows us to go directly from a context free grammar 
to a functional equation for generating functions. 

Definition 2 (Context free grammar). A context free grammar is given by G = {V,Ti,R,S), 
where V is a finite set of nonterminal symbols (also called variables), T, is a disjoint finite set of 
terminal symbols, S V is the start nonterminal, and 

RcV X (VU^)* 

is a finite set of production rules. Elements of R are usually denoted by A ^ w, rather than 
iA,w). 

If x,y Cz (y U S)* and A ^ w is a rule, then by replacing the occurrence of A in xAy we obtain 
xwy. Such a derivation in one step is denoted by xAy =^g xwy, while the reflexive, transitive 
closure of is denoted =^q- The language generated by context free grammar G is denoted by 
L{G), and defined by 

L{G) = {w G S* : w}. 

Now, in the following sections, we give context free grammars for RNA secondary structures, 
including structures with explicitly annotated dangles. Using the correspondence between gram- 
mar and recursions for dynamic programming, each grammar corresponds to an algorithm for the 
partition function for secondary structures with respect to a different energy model - the Nussi- 
nov model, the base stacking energy model, the Turner model, the Turner model with a rigorous 
treatment of dangles, the Turner model with external dangles. For notational simplicity, we take 
6, the minimum number of unpaired bases in a hairpin loop to be 1 (see condition 4 of Definition 
[T]). It is not difficult to extend the grammar for any fixed value of 00 



Nussinov energy model. In [33], Nussinov and Jacobson describe a dynamic programming 
algorithm to compute the minimum energy structure for a simple energy model, in which each 
base pair constitutes an energy term of —1. 

It is well-known |26| that the following unambiguous grammar Gi generates all secondary 
structures of the homopolymer model with 9 = 1. Here Gi has start non-terminal symbol S, 
and terminal symbols •, ( , ) . The non-terminal symbol S generates all non-empty secondary 
structures by using the following grammar (or production) rulesQ 

S -> •|(S')|S'(5) 

Let S{z) denote the complex generating function S{z) ~ X^J^o ^nz"', where Taylor coefficient 
[z"]S{z) is the number s„ of secondary structures for a homopolymer of size n. By the DSV 
methodology [HI dS] , we have 

S{z) = S^z + zS + z'^S + z'^S^. 
Introducing the auxilliary variable u to count number of base pairs, we have 

S{z, u) = S = z + zS + uz^S + uz'^S'^ = Sk^nu'^z" 

n k<n 

where Sk^n denotes the number of secondary structures on a length n homopolymer, having exactly 
k base pairs. It follows that 



dS{z,u) 



du ^ f-r' 

n k<n 



hence 

(1) [z-f-^{z,l) = Y.ks,, 

k<n 



^This is done, for instance, in grammar G4 by replacing the rule • >g — > • by • >g — > where consists 
of 6 occurrences of • . 

''Our grammar Gi is equivalent to the "tree grammar nussinov78" from |39| . 
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Since 

[z"]^(z,l) = ^Sfe,„ 

is the number of secondary structures on a homopolymer of length n, it foUows that the asymptotic 
expected energy over aU secondary structures of a homopolymer of length n, with respect to the 
Nussinov energy model, is equal to —1 times the asymptotic expected number of base pairs 

(2) - hm ] ^ l ' . 

Base stacking energy model. In the base stacking energy model, an energy term of —1 is 
assigned to each base pair of structure 5, provided that («, j) has an outer stacking pair - i.e. 
provided that (i + 1 , j — 1) G S. The set of all secondary structures is generated by the context free 
grammar G2 with non-terminals S", T, start symbol S", and terminals •, ( , ) with the following 
rules: 

s •|S'.|r|S'r 

T ^ ( • ) I ) I (T) I iST) 

Here, the non-terminal S generates all secondary structures, while the non-terminal T generates all 
secondary structures, such that the first and last positions are base-paired together. By introducing 
auxilliary non-terminal T, we can count the number of stacked base pairs, as well as the number 
of base pairs. It is not difficult to show by induction that G2 is an unambiguous grammar that 
generates all secondary structures, hence is equivalent to the previous grammar Gi. 

By the DSV methodology [8l[26], the generating function S{z) = s„2:" satifies the following 
equations 

S{z) = S = z + zS + T + ST 

T{z) = T = z^T + z^ST + z^ + z^S. 

Introducing the auxilliary variables u, v responsible for counting the number of base pairs resp. 
number of stacked base pairs, we have 

S{z,u,v) = S = z + zS + T + ST 

T{z,u,v) = T = uvz'^T -\- uz'^ST -\- uz'^ + uz^S. 

Letting Sk,m,n denote the number of secondary structures on a length n homopolymer, having k 
stacked base pairs and m base pairs, we have 



n k,7n<n 



n k,m<n 



du 
hence 

(3) = 

k.ra<n 

Since S{z, 1, 1) is is the number of secondary structures on a homopolymer of length n, it follows 
that the asymptotic expected energy over all secondary structures of a homopolymer of length 
?i, with respect to the base stacking energy model, is equal to —1 times the asymptotic expected 
number of stacked base pairs, 

4 - lim / ' '/ . 

^ ' n^oo 5- 2,1,1) 
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Figure 1. Secondary structure together with various loops: stacked base pair, 
hairpin, bulge, internal loop, multiloop. 



Grammar for McCaskill algorithm. All thermodynamics-based RNA secondary structure 
prediction algorithms use the Turner nearest neighbor energy model [30[ I47| , which contains free 
energy parameters for base stacking, single nucleotide dangles, hairpins, bulges, internal loops and 
multiloops. These parameters are obtained by a least squares fit of UV absorption data in optical 

5'-AC-3' 

melting experiments. For instance, at 37° C the RNA-RNA stacking free energy of „, j., is 

5'-CC-3' 

—2.24 kcal/mol and that of „, is —3.36 kcal/mol |17]- Software such as infold of Zuker 

3 -GG-o 

[53] and RNAf old from Vienna RNA Package [21] use the Turner energy model, while alternative 
approaches, such as Pf old [24] use stochastic context free grammars. 

In |31| . McCaskill describes a cubic time, dynamic programming algorithm to compute the 
partition function Z — exp(— i?(S')/_Rr) over all secondary structures S' of a given RNA 
sequence. Here R is the universal gas constant, T is absolute temperature, and £'(5') is the energy 
of structure S with respect to the Turner energy model [47]. By analyzing McCaskill's recursions, 
we obtain the following grammar G3, which generates the same set of secondary structures as 
that generated by Gi, G2', however, by permitting the classification of various types of loops, the 
grammar G3 will permit us later to incorporate energy terms for dangles, also known as single- 
stranded, stacked nucleotides, into our considerations. A stacked base pair in secondary structure 
S is given by base pair € 5, such that (i — j + G S*. A hairpin loop in secondary structure 
S is given by base pair (i, j) € S, such that i + 1, . . . , j — 1 arc unpaired in S. A left bulge of 
S is given by base pairs {i,j), {k,£) G S, such that i + l<k<£<j and £ + I ~ j. A right 
bulge of S is given by base pairs («, j), (fc, £) g 5, such that i<k<£<j— 1 and i + 1 — k. An 
internal loop of S is given by base pairs {i,j), {k,£) G S, such that i + l<k<£<j~l; i.e. an 
internal loop is comprised of both a left and right bulge. A multiloop M of S is given by base 
pairs {i,j), {ki,£i), . . . , {kr,£r) G S, such that i < ki < £1 < ■ ■ ■ < k^ < £r < j , where r > 2, and 
positions i-fl,...,fci — l,^i-|-l,...,fc2 — l,^2 + l,---:fcr~l,^r + l,---7.7~l are all unpaired in S. 
For any positions i < x < y < j , where we do not require x or y to be base-paired, we say that the 
multiloop restricted to [x, y] has c components, if exactly c of the base pairs (fci, £1), . . . , (fc^, £r) 
are found in the interval [x,y]. See Figure [1] for an illustration of various loops, and see [52l [5Tj 
for more on loop classification and the Turner energy model. 

Let grammar G3 contain non-terminal symbols S (start), U (unpaired portion), B (base-paired 
portion), Afi (multiloop with exactly one component), M (multiloop with at least one component). 
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with the fohowing production rules 

S •\B\S»\SB 

B ^ (U)\(B)\(UB)\(BU)\( UBU ) | ( MMi ) 
U •{•U 

Ml B\BU 
M Mi\UMi\MMi 

It is not difficult to show that G3 is an unambiguous context free grammar equivalent to Gi , 6*2 , 
thus generates all secondary structures. The grammar G3 is equivalent to the "tree grammar 
wuchty98" as defined in [55], though notation is vastly different. 

Grammar for Markham-Zuker algorithm. The Markham-Zuker software UNAFOLD [5S] is 

the only current thermodynamics-based algorithm that computes the partition function for RNA 
secondary structures in a mathematically rigorous manner, including correct treatment of energy 
contributions from single-stranded, stacked nucleotides - also called dangles. By enlarging the set 
of terminal symbols, we describe here an unambiguous context free grammar G4, which generates 
all secondary structures with dangle explicitly given. M. Zuker (personal communication) has 
mentioned that the algorithm UNAFOLD may take approximately twice as long to run when the user 
chooses to include treatment of dangles. As we will later see, an explanation for this phenomenon is 
that the asymptotic number of secondary structures, where the dangle state is explicitly annotated, 
is much larger than the number of secondary structures. 

The context free grammar G4 has start symbol S, terminal alphabet {5, 3, • , ( , ) } and non- 
terminal alphabet {S,B, M, Mi, U, • >e} and rule set 

S •\S>\{e + S}B\{e + S}5B\{e + S}B3\{e + S}5B3 

B ^ ( •>e)\(B)\(UB)\(BU)\(UBU)\ 

( MMi ) I ( UIMi ) I ( MMi5 ) I ( 3A/Afi5 ) 
M ^ {e + U + M}Mi 
Ml Mi»\B\5B\B3\5B3 
•>e •\»>bU 
U •![/• 

Note that +,e are meta-symbols, used to express the rules more succinctly. For instance, 5' — > 
{e -f S}B is an abbreviation of the rules S B and S — > SB. The previous rules provide for 
an unambiguous context free grammar that generates all non-empty secondary structures, where 
all dangles are explicitly annotated. For instance, 5 ( • • • ) indicates that in the secondary 
structure • (•••), the first position is single-stranded nucleotide which is 5' to the position 2, 
and stacks on the base pair (2, 6). Similarly, ( • • • ) 3 indicates that in the secondary structure 
( • • • ) • , the last position is single-stranded nucleotide which is 3' to the position 5 and stacks 
on the base pair (1,5). Since the Turner energy parameters for hairpins, bulges and internal loops 
already include contributions for single-stranded positions within the loop which may dangle on 
the outer, closing base pair, it follows that in thermodynamics-based structure prediction, we do 
not consider internal dangles in hairpins, bulges or internal loops of the form (3---), (■■■5), 
( 3 • • • 5 ) , though such internal dangles are considered in multiloops. Of course, external dangles 
of the form 5(---), (•••)3 and 5 ( • • • ) 3 are considered for all types of loops. 

In the grammar G4, non-terminals represent the following: S denotes the start symbol to 
generate all structures, B indicates that the leftmost and rightmost positions are paired together, 
M denotes a substructure located within a multiloop, having at least one component (the base 
pair closing the multiloop has been generated before non-terminal M) , Mi denotes a substructure 
located within a multiloop, having exactly one component, where additionally the leftmost position 
is paired with a position in the substructure to the right (though not necessarily the rightmost 
position). 
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Note that the Markham-Zuker approach aUows dangle annotations of the form ( BB5 ) and 
( BBS ) ; i.e. where a single-stranded position occurring between two closing parentheses can be 
annotated as either a 5'-dangle, 3'-dangle, or no dangle. Indeed, 

S^B^ (A/A/i5) (A/iA/i5) (BMi5) => (BB5) 

and 

S=^B^ (MMi) (A/iMi3) ^ (BA/i3) ^ (BBS) 

and 

S^B^ (MMO =^ (MiMi) (BMO ^ (BB) 

Theorem 1. In the homopolymer model, where the minimum number of unpaired bases in a 
hairpin loop is 1, the asymptotic number of secondary structures with annotated dangles, following 
the Markham-Zuker recursions in |29| is 

Sn ~ 0.63998 • . 3.06039". 

Proof sketch: It is not difficult to prove by recursion on n that the set of dangle-annotated 
secondary structures of length n generated by grammar G4, is equal to the value of the Markham- 
Zuker partition function described in pages 14-16 of [29], provided that all energies are set to 00 
Now apply DSV methodology and analyze the dominant singularity using the Flajolet-Odlyzko 
theorem, as fully described in |26| . In supplementary material, we provide a detailed computation 
using Mathematica. □ 

Grammar for external dangles. Define external dangle to mean a 5'-dangle, which occurs 
to the immediate left of an opening parenthesis, or a 3'-dangle, which occurs to the right of a 
closing parenthesis. Since our work is theoretical in nature, in the construction using plane trees 
in Section [SI we choose to consider external dangles for all loops, not simply external loops and 
multiloops. We now give a context free grammar for secondary structures having possible 5-dangles 
and 3-dangles in bulges, internal loops, multiloops and external loops. 

Let G5 be a context free grammar with start symbol S, terminal alphabet {5, 3, • , ( , ) } and 
non-terminal alphabet {S, B, M, Mi, U, • >g} and rule set 

S •\S»\{e + S}B\{e + S}5B\{e + S}B3\{e + S}5B3 

B ^ ( •>g)\(iB)\({5 + U5 + U}B)\(B{3 + 3U + U})\ 

({5 + U5 + U}B{3 + 3U + U})\( MAh ) 
M {e + U + M}Mi 

Ml Mi»\B\5B\BS\5B3 
•>e •|»>eC/ 
U •![/• 

As in grammar G4, the symbols -|-,e are meta-symbols, to permit a concise representation of 
grammar rules; moreover, the meaning of non-terminals S, B, M, Mi is the same in G5 as in G4. 
It can be proved by induction that grammar G5 is an unambiguous context free grammar, that 
generates all non-empty secondary structures with explicitly annotated 5-dangles and 3-dangles, 
i.e. those dangles that are external to any type of loop, whether the loop is a hairpin, bulge, 
internal loop, multiloop or external loop. 

Theorem 2. In the homopolymer model, where the minimum number of unpaired bases in a 
hairpin loop is 1, the asymptotic number of secondary structures with annotated external dangles, 
generated by grammar G5 is 

Sn 0.96691 • . 3.079596". 



It is clear that the number of structures equals the partition function J^g exp(— i?(S')/i?T) provided that 
E{S) = 0. 
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Proof sketch: Using the DSV methodology, we analyze the dominant singularity using the 
Flajolct-Odlyzko theorem, as fully described in |26] . In supplementary material, we provide a 
detailed computation using Mathematica. Moreover, in the latter part of this paper, in a self- 
contained manner, we give an alternate proof using plane trees. □ 

Grammar for saturated structures. In [8j, we presented the following grammar which gener- 
ates all saturated secondary structures in the sense of Zuker [49j ; i.e. locally optimal with respect 
to the Nussinov energy model. Let Gg be the context-free grammar with nonterminal symbols 
S, R, terminal symbols •, ( , ) , start symbol S and production rules 

S •{••\R»\R»»\(iS)\S (S) 

R (S)\R(S) 

It can be shown by induction on expression length that L{S) is the set of saturated structures, 
and L{R) is the set of saturated structures with no visible position; i.e. external to every base 
pair. Here, position i is said to be visible in a secondary structure T if it is external to every base 
pair of T; i.e. for all {x,y) G i < x ot i > y. 

It is possible to describe context free grammars that generate (1) all secondary structures, 
(2) all saturated secondary structures, (3) all G-saturated secondary structures, optionally with 
annotated external dangles. However, the subsequent analysis of dominant singularity becomes 
increasingly arduous. For this reason, beginning in Section |4l we present a new, unified method 
using duality, marked plane trees, substitution of generating functions, and the Drmota-Lalley- 
Woods theorem (see Theorem [3]). 

3. COMPUTATIONAL RESULTS 

In this section, we present computational results to highlight differences between the Nussinov 
model and the base stacking energy model, and additionally to determine the relation between 
folding time and number of saturated structures. Figure [3] displays a melting curve with respect 
to the Nussinov energy model and the base stacking energy model. By extending ideas we first 
described in [3], we developed two algorithms (one for the Nussinov model and one for the base 
stacking energy model), each running in time O(n^) and space 0{n^), to compute the expected 
number of base pairs as a function of temperature. Figure [3] clearly shows that the melting 
temperature Tm, depends on the energy model, where Tm is defined as the temperature at which, 
on average, half the base pairs of the high temperature structure are no longer present. Moreover, 
as the figure shows, the base stacking energy model leads to more cooperative folding, as signified 
by the sigmoidal nature of the curve (see Dill and Bromberg [9] for a discussion of cooperative 
folding) . 

Additionally, the Nussinov energy model and the base stacking energy model are remarkably 
different with respect to pseudoknotted structures, defined by dropping requirement (3) in our 
definition of secondary structure; i.e. a pseudoknotted structure S allows base pair crossings of 
the form {i,j),{k,() £ S, where i < k < j < i. While Tabaska et al. [42] showed that the 
minimum energy pseudoknotted structure can be computed in cubic time 0{n^) by using the 
maximum weighted matching algorithm, provided one considers the Nussinov energy model, in 
the preprint [38] . Sheikh et al. show that determination of the minimum energy pseudoknotted 
structure for the base stacking energy model is A^P-complete, a refinement of a result of Lyngs0 
and Pedersen [27]. 

In Tables [Tj and [21 we present computational results that suggest that the folding time is strongly 
correlated with the minimum free energy and weakly correlated with the number of saturated 
structures]! For the results summarized in Table [U we took 61 RNA sequences from the Rfam 
family RF00031 seed alignment [15]. For each RNA sequence, the following were computed: (1) 
sequence length, (2) minimum free energy (MFE) using RNAf old from the Vienna RNA package 
[l9] . the GC-content, defined as the number of G and C nucleotides divided by sequence length, 

secondary structure S for a given RNA sequence s = si,...,Sn S {A,C,G,U}* (not homopolymer) is 
saturated, provided that for any base pair 5, the structure T = SU{(i,j)} is not a valid secondary structure 

for s; i.e. one of conditions 1,2,3,4 of Definition [l] is violated by T. 
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Melting curves for two simple energy models 




28 I 1 1 1 1 1 1 1 

-300 -200 -100 100 200 300 400 

temperature in Celsius 

Figure 2. Theoretical melting curve for two simple energy models of RNA sec- 
ondary structure. Temperature in Celsius is given on the x-axis, while expected 
number of base pairs is given on the y-axis. We implemented an algorithm, using 
dynamic programming, with run time O(n^) and space 0(n^), to compute the 
partition function Zk = X^seSt exp{—E{S)/RT), where {S)k denotes the set of all 
secondary structures for a homopolymer of length 100 nt, having exactly k base 
pairs. The expected number of base pairs is thus X]fc k-pk, where pk = ^ denotes 
the probability that a secondary structure has k base pairs, and Z denotes the full 
partition function Z = exp(— i?(5)/i?r) — J^k In the Nussinov-Jacobson 
energy model [33], E{S) is defined to be —1 • l^l; i.e. —1 times the number of base 
pairs of 5*. In the base stacking energy model, E{S) is defined to be —1 times the 
number of stacked base pairs of S. Although both models are quite similar, we 
see that the melting curves are indeed different, where the base stacking model 
entails more cooperative folding (see [3] for discussion of cooperative folding). 



the number of saturated structures as computed by our own program RNALOSS [3l |4] . Additionally, 
for each sequence, the log base 10 of mean first passage time, taken over 50 Monte Carlo kinetic 
folding experiments was computed. Here we used an unpublished program written by E. Freyhult 
and P. Clote, which starts from the empty structure, and by a Monte Carlo Metropolis method 
either adds or removes a single base pair from the current structure until the minimum free energy 
(MFE) structure is found. 

Table [1] shows a number of correlations: (1) as G-C content increases, the minimum free energy 
decreases - this is because GC stacked base pairs contributed a lower energy, i.e. negative with 
larger absolute value, than do AU or GU stacked base pairs; (2) as sequence length increases, 
so does the number of saturated structures; (3) lower MFE corresponds to faster folding, an 
observation made for a toy model of protein folding in seminal work of Sali et al. [35] ; (4) as the 
number of saturated structures increases, so does the folding time, as measured by MFPT. 

The selenocysteine insertion sequence (SECIS) RNAs from Rfam family RF00031 have approx- 
imately the same sequence length of 64.32 ± 2.83 nt, and approximately the same GC-content of 
49.69 ± 10.65%. Though these values are approximately the same for each sequence, in order to 
entirely remove the influence of both sequence length and GC-content, we computed the same 
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Icn 


MFE 


GC-content 


numSatStr 


loglOMFPT 


Icn 


1 










MFE 


-0.3685 


1 








GC-content 


0.0796 


-0.7929 


1 






numSatStr 


0.4437 


-0.3269 


0.1450 


1 




loglOMFPT 


0.4059 


0.3990 


-0.3259 


0.3587 


1 



Table 1. Pearson correlation between various aspects of selenocysteine inser- 
tion sequences from the seed alignment of Rfam family RF00031 [TS]. For each 
RNA sequence, the sequence length (len), minimum free energy (MFE), percent- 
age of G-l-C nucleotides (GC-content), number of saturated structures (numSat- 
Str). Additionally, for each sequence, the log base 10 of mean first passage time 
(loglOMFPT), taken over 50 Monte Carlo kinetic folding experiments was com- 
puted, along with the log base 10 of the standard deviation (not shown, but 
approximately 9% of loglOMFPT on average). The table shows the correlation 
between each of these aspects. Some correlations are obvious, or follow from 
known results or results proved in this paper - for example, as sequence length 
increases, we can expect the number of saturated structures to increase exponen- 
tially in the sequence length [5]. Nevertheless, the average sequence length over 
this RF00031, consisting of 61 sequences, is 64.32 ± 2.83, so the sequences are 
approximately all of the same size. Note that the logarithm of the mean first 
passage time (i.e. folding time) increases as the number of saturated structures 
increases. 





MFE numSatStr loglOMFPT 


MFE 

numSatStr 
loglOMFPT 


1 

0.1515 1 

0.9281 0.1770 1 



Table 2. Pearson correlation for 50 random RNA, each of length 50 nt and of 
GC-content 48.39%, generated by applying the Altschul-Erikson algorithm [U [7] 
to the RNA with EMBL accession code S79854.1/1605-1666, taken from the Rfam 
family RF00031 seed alignment [15]. 



values as previously described, but in place of 61 RNAs from RF00031, we generated 50 RNA se- 
quences, each of length 50 nt, by applying the Altschul-Erikson algorithm [1] [7] to the RNA with 
EMBL accession code S79854. 1/1605-1666, taken from the Rfam family RF00031 seed alignment 
}15j . Since the Altschul-Erikson algorithm generates random sequences, each having exactly the 
same number of each nucleotide A,C,G,U and of each of the 16 dinucleotides, it is often used to 
generate control sequences when using RNA secondary structure algorithms [7]. In this fashion, 
each random sequence has GC-content of exactly 48.39%. Table [2] indicates a strong correlation 
between minimum free energy (MFE) and mean first passage time (MFPT), as well as a weaker 
correlation between the number of saturated structures and the MFPT0 

4. Enumeration of locally optimal secondary structures 

4.1. Duality: RNA secondary structure -H> weighted plane tree. It is well known that 
secondary structures have a tree shape, and there are several ways to formulate it. Here we 
find convenient to associate in a bijective way to a secondary structure (in the homopolymer 
formulation) a rooted plane tree with nonnegative integers (weights) at the corners and at the 



In data not shown, wc performed a similar eomputational experiment where 50 RNA sequences were generated, 
each of length 50 nt, each having expected GC-content of 49.69 (using a 0th order Markov chain, or coin flipping). 
Results were similar as those from Table [2] 
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b) c) d) 

Figure 3. (a) A (homopolymer) secondary structure, (b) deformed into a tree- 
like shape, (c) the reduced structure superimposed with the dual rooted plane 
tree (in dashed lines, with the root indicated by an ingoing arrow), (d) the rooted 
plane tree with weights at corners (surrounded by circles) to indicate segment 
lengths, and weights at edges (surrounded by squares) to indicate stem lengths. 



edges. The transformation is shown in Figure [31 Start with a secondary structure S of length n, 
the elements in the sequence being ranked from 1 to n. Call segment of S' a sequence i, i + 1, . . . , j 
such that i < j and: (i) either 1^=0, orl<i<n and the element i is linked, (ii) either j = 7i + 1, 
or 1 < j < n and the element j is linked, (iii) all elements in z + 1, . . . , j — 1 are free. Note that 
there are j — i — 1 free elements in the segment. Then perform two reduction operations on S: 

Stem- reduction: Replace each stem Iq, . . . by a single link. 

Segment-reduction: Replace each segment by a unit segment (with no free element on it). 

Call R the reduced structure (which has no free element). Given the standard plane repre- 
sentation of R, draw a vertex, called a dual vertex in each region, and for each link of R, draw 
a dual edge connecting the vertices in the regions on each side of the link. The obtained figure 
(keeping the dual vertices and dual edges only) is a rooted plane tree T. Note that each edge of T 
corresponds to a link of R (hence corresponds to a stem of S), and each corner of T corresponds 
to a segment of S. We weight T by giving to each of its corners a weight corresponding to the 
number of free elements in the corresponding segment, and giving to each of its edges a weight 
corresponding to the length of the corresponding stem. Several parameters are in correspondence 
through the bijeetion (we use the standard terminology for parameters of secondary structures, 
a node of a tree is called a leaf if its arity is and an inner node if it has positive arity): See 
Table [3] for a summary of the correspondences between secondary structure loops and nodes of a 
weighted tree. 
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secondary structure S weighted tree T 

hairpin <-> leaf 
bulge -;-> inner node with one child 
multiloop -!->■ inner node with several children 
segment with L free elements ^ corner of weight L 
stem of length k o edge of weight k 
Table 3. Correspondence between types of loop in secondary structure S and 
types of node in the plane tree T obtained by duality. 



Note also that the number of links of S is the number \E\ of edges plus the total weight 
over all edges, and that the number of free elements of S is the total weight Wc over all corners, 
hence the length n of 5 satisfies n = 2\E\ + 2 We + Wc- 

A weighted rooted plane tree with at least one edge is called admissible if it corresponds to a 
valid secondary structure (which has at least one link), i.e., if the weights satisfy the following 
conditions: 

: 1. Each non-root node with one child has at least one of its two incident corners of positive 

weight (otherwise the stem-reduction would not have been complete). 
: 2. Each corner at a leaf has weight at least 6 (to satisfy the 6'-threshold condition). 
: 3. Each edge has weight at least r (to satisfy the r-threshold condition). 

4.2. Generating functions. For r > 1, a weighted combinatorial class indexed by r parameters is 
a set A together with a weight- function W from ^ to M and r parameter- Junctions Pi, . . . , Pr (one 
for each parameter) from ^ to N such that for any fixed integers tii, . . . , n^, the set of structures 
•y E A such that -Pi (7) = ni, . . . ,-Pr(7) — n-r is finite. This set is denoted ^[ni, . . . ,nr]. The 
corresponding multivariate generating function is 

(5) Aixi,...,xr) ^xf^^^ W(7). 

We say that variable Xi marks the parameter Pi, for 1 < i < r. We also use the notation 

[x'^^ ...x:^-]Aixi,...,Xr):= 

7E-4[ni,...,nr] 

In general we consider enumerative generating functions, where VF(-) assigns weight 1 to each 
structure. However we allow ourselves to weight these structures, e.g., to weight each secondary 
structure by with p a so-called stickiness parameter. The variables Xi are a priori con- 

sidered as formal, but one can also evaluate a generating function at given values, provided the 
sum converges. The convergence domain of Aixi, . . . ,Xr) is the set of r-tuples [xi, . . . ,Xr) of 
nonnegative real values such that A{xi, . . . , Xr) converges. 

As a first example, we briefly recall here how to enumerate (homopolymer) secondary structures, 
via the dual representation by weighted rooted plane trees and using generating functions. Let 
be the family of rooted plane trees, possibly reduced to a single vertex, with some marked corners 
(to be occupied by positive weights later on) incident to inner nodes such that each node with 
one child has at least one marked corner. Let F = F{u,v,x) be the generating function of !F 
where u marks the number of leaves, v marks the number of marked corners, and x marks the 
number of edges. When the root- node v has arity 1, exactly one of its two corners is marked, 
hence the generating function for trees in whose root-node has arity 1 is 2vxF. When the 
root-node v has arity k > 2, there are (fc -f 1) corners incident to v, and each of these can be 
marked (independently). Hence the generating function for trees in J- where the root- node has 
arity fc is (1-1- v)^^"^ x'^ F^ . Consequently, F satisfies 

(6) F = u + {2v + v^)xF + ^ x'^il + vf+^F'' = u+ ^ ^^^^ - xF. 

k>2 l~x{l-^v)F 
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Let Q be the family of rooted plane trees with at least one edge and with some marked corners 
(to be occupied by positive weights later on) incident to inner nodes such that each non-root node 
with one child has at least one marked corner. Let G = G{u, v, x) be the generating function of 
Q where u marks the number of leaves, v marks the number of marked corners, and x marks the 
number of edges. Again by decomposing at the root, we get 



(7) G=} x^l + vf+'R 



V- ' - l^x(l+v)F' 

k>l ^ ' 

Let s) be the series counting secondary structures with at least one link, where t marks the 
number of free elements, and s marks the number of links. Note that g(t, s) is also the generating 
function of admissible rooted weighted plane trees where t marks the total weight over corners, 
and s marks the number of edges plus the total weight over edges. Such a tree is uniquely obtained 
from a tree in Q where each corner at a leaf is assigned a weight of value at least 0, each non- 
marked corner at an inner node is assigned weight 0, each marked corner is assigned a positive 
weight, and each edge is assigned a weight of value at least t. Hence we have g{t,s) = G(\J ,V,X), 
where 



i>e i>T 
To summarize, we have an expression (written as a system of two equations) for the generating 
function g{t, s) enumerating secondary structures with at least one link, where t marks the num- 
ber of free elements and s marks the number of links (the generating function of all secondary 
structures, including the ones with no link, is clearly g{t, s) + t + t"^ + ■ ■ ■ = g{t, s) + jz^)- In- 
deed, if we define f{t, s) := F{U, V, X), then we easily see (since the substitutions of variables are 
rational expressions whose series-expansion have nonnegative coefficients) that there is a one-line 
equation specifying f{t,s), of the form f{t,s) = Q{t^ f(t, s)), with Q = Q{t,s,y) a rational 
expression whose series-expansion (in s, t, y) has nonnegative coefficients. And there is a ratio- 
nal expression R = R{t,s,y) whose series-expansion has nonnegative coefficients and such that 
g{t, s) = R{t, s, f{t, s)). Precisely 

^ , . ( t'^ t s"+i\ . x{l + vfy 

Q ~ substitute u = , v = , x = into u H ; ^ xv, 

^ \ 1-t' 1-s) l~x{l + v)y 

„ , . , t s^+i\ . x(l+vfy 
R — substitute V = , x — into 



l-t l-sj l-x{l + v)y 

This allows us to extract the counting coefficients. Let 5p(i) be the weighted generating function 
of secondary structures where t marks the length, and where each structure has weight 
gp{t) — g{t,pt'^) + t/{l — t) (the term i/(l — t) gathers secondary structures with no link); for 
instance for 9 = \ and r = we find 

g^it) = t + t2 + (i+p)t3 + (i + 3p)i4 + (i + 6p-f/)t^ + (l + 10p4-6/)t^-l-(l4-15p+20/-f/)i^ + - • • . 

4.3. Counting saturated structures. The Nussinov energy E{S) of a secondary structure S is 
defined as E{S) — -~L{S), with L{S) the number of links in 5*. A secondary structure 5* is called 
saturated (or locally optimal for the Nussinov energy) if it is not possible to add a link to S (i.e., 
decrease the energy by 1) while keeping a valid secondary structure. 

Lemma 1. Assume r = (no restriction on the lengths of stems). Saturated secondary structures 
with at least one link correspond to admissible weighted rooted plane trees such that: 

• all corners have weight at most 9 + 1, 

• at each node there is at most one comer of strictly positive weight. 

As shown in Figure 31 if there are two positive corners at the same inner node, then it is possible 
to add a link. Also, if there is a corner with weight at least 0-1-2 then one can link the first and 
last free elements in the corresponding segment. Hence the weight of each corner is at most 9 + 1. 
And these are the only two situations where it is possible to add a link without breaking planarity 
nor breaking the ^-threshold condition. □ 
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Figure 4. Situations where it is possible to add a link to a secondary structure. 



Call T the family of rooted plane trees with some marked corners incident to inner nodes (these 
marked corners are to be occupied by positive weights later on) such that: (i) each node with one 
child has exactly one marked corner, (ii) each node with several children has at most one marked 
corner. Let F = F(u,v,x) be the generating function of where u marks the number of leaves, 
V marks the number of marked corners, and x marks the number of edges. When the root-node v 
has arity 1, exactly one of its two corners is marked, hence the generating function for trees in 
whose root-node has arity 1 is 2vxF. When the root-vertex v has arity fc > 2, there are {k + 1) 
corners incident to v, and at most one of these corners has positive weight. Hence the generating 
function for trees in T where the root- vertex has arity fc is (1 -I- (fc -f l)v)x''F''. Consequently, F 
satisfies 

F = u + 2vxF + ^(1 + (fc + 1)v)x''F'', 

k>2 



Hence, using the identity J2k>o(^ + ^)'^^ = 1/(1 ^ ^)^' ^ satisfies 

(8) F = u+ h- 

^ ' l-xF {I- xFf 

Now let Q be the family of rooted plane trees with at least one edge, and with marked corners 
incident to inner nodes such that: (i') each node v with one child has exactly one marked corner if 
V is different from the root-node, and has at most one marked corner if v is the root-node, (ii) each 
node with several children has at most one marked corner. Let G = G{u,v,x) be the generating 
function of Q where u, v, x mark respectively the numbers of leaves, marked corners, and edges. 
Decomposing again at the root, we get 

xF V 



(9) G = ^(l + (fc + l)u).x''F'^ = 



1-xF (1 - a-F)2 

k>l ^ ' 

We take here r = (no restriction on the lengths of stems). Let g{t,s) be the generating 
function of saturated secondary structures with at least one link, where t marks the number of 
free elements and s marks the number of links. Then Lemma [T] ensures that g{t, s) = G{U,V, X), 
where 

U = t'{i + t), y = t + ... + t«+i = i-^, x^-^. 

1 — t 1 — s 

To summarize (in a similar way as for general structures), we have an expression (written as 
a system of two equations) for the generating function g(t, s) enumerating saturated secondary 
structures with at least one link, where t marks the number of free elements and s marks the 
number of links (the generating function of all saturated secondary structures, including the ones 

with no link, is g{t,s) + t'] hi''+^ = g{t, s) + *-^^). Indeed, if we define f{t,s) := F{U,V,X), 

then there is a one-line equation specifying f{t,s), of the form f{t,s) = Q{t, s, f(t, s)), with 
Q(t, s, y) a rational expression whose series-expansion (in s, t, y) has nonnegativc coefficients. 
And there is a rational expression R{t,s,y) whose series-expansion has nonnegative coefficients 
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Figure 5. Situations where it is possible to extend a stem of a secondary structure. 



and such that g{t, s) 

Q — substitute iu~t^{l 



R{t,s, f{t,s)). Precisely 

t - 



R = substitute 



t),v 

t - 



1 - 
^e+2 



t 



into u 



into 



xy 



9 2 

x^y'^ 
l~xy 

V 



[l-xyY' 



1 — t' I — s J I — xy (1— xy)^ 

Again this allows us to extract the counting coefficients. Let gp{t) be the weighted generating 
function of saturated secondary structures where t marks the length, and where each structure 
has weight ^^^^^ ^ g[t,pt^) + t + h t^+^\ for 6* = 1 and r = we find 

gp{t) ^t + f+pt^ + Spt^ + (4p + p2)<5 _^ + Qp^^t^ + {I7p^+p^)f + ■■■ . 

4.4. Counting G-saturated structures. The base stacking energy E{S) of a secondary struc- 
ture S is defined as £'(5') := —T{S), with T{S) the sum of sizes of all stems of S. A (homopolymer) 
secondary structure is called G-saturated (locally optimal for the base stacking energy) if it is not 
possible to add a link so as to extend a stem (i.e., decrease by 1 the base stacking energy). In 
general, the addition of a link to a secondary structure either creates a new stem of length or 
extends an already existing stem. Hence, in a G-saturated structure a valid link addition always 
creates a new stem of length 0. In case t > 0, creating a stem of length is not a valid link addition 
(since the stems must have positive length), hence no valid link addition to a G-saturated is pos- 
sible for r > 0. In other words, the concepts of saturated and of G-saturated structures coincide 
when r > (whereas for r = the class of saturated structures is strictly contained in the class 
of G-saturated structures). In this section we enumerate G-saturated structures according to the 
number of free elements and the number of links, for any given values of the threshold parameters 
T and 9. 

Again we formulate the conditions on the dual representation. For this purpose we define 
adjacency of corners. Two corners c and c' of a rooted plane tree T are called adjacent if they 
arc incident to the same vertex ?; of T and there is an edge e incident to v such that c and c' are 
the corners incident to v on each side of e. Note that the two corners on each side of the root 
(the root is represented as an ingoing arrow in Figure are considered as adjacent only when the 
root-node v has arity 1 (in which case they are adjacent through the unique edge incident to v). 

Lemma 2. The G-saturated secondary structures with at least one link correspond to admissible 
weighted rooted plane trees such that: 

• the corners at leaves have weight at most 9 -\- 1, 

• any two adjacent corners can not both have strictly positive weight. 

As shown in Figure [5l if there are two adjacent positive corners, then it is possible to add a 
link so as to extend an existing stem. Also, if there is a corner of weight at least 9 -\-2 a.t a. leaf £, 
then one can link the first and last free elements in the corresponding segment and thus extend 
the stem associated to the edge leading to i. Hence the weight of a corner at a leaf is at most 
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6 + 1. And these are the only two situations where it is possible to extend a stem without breaking 
planarity nor breaking the 0-threshold and r-threshold condition. □ 

Call J- the family of rooted plane trees with some marked corners incident to inner nodes (again 
these marked corners are to be occupied by positive weights later on) such that: (i) each inner 
node with one child has exactly one marked corner, (ii) two corners can not both be marked if 
they are adjacent or if they are the two corners on each side of the root (the root is indicated by 
an ingoing arrow in Figure Let F = F(u, u, x) be the generating function of J- where u marks 
the number of leaves, v marks the number of marked corners, and x marks the number of edges. 
Finding an equation satisfied by is a little more involved than for saturated structures. At first 
we need a preliminary study on independent sets (i.e., sets containing only pairwise non-adjacent 
elements) on a fc-sequence or on a fc-cycle. 

For A; > and m < k, let Ck^m (resp. Sk,m) be the number of ways of choosing m marked 
elements on the oriented cycle (1,2,..., k) (resp. sequence 1, 2, . . . , fc) of fc elements such that no 
two consecutive elements are marked, and let Ck = Ck{v) := J^m'^k.m.v"'' (resp. Sk — Sk{v) :~ 
X^m ■^k,mv"^) be the corresponding (polynomial) generating function. The polynomials Sk are well- 
known to be the Fibonacci polynomials and satisfy an easy recurrence which we briefly recompute 
here. We take the convention 5*0 = 1. Let k > 2. If an independent set on the fc-sequence starts 
with a marked element, then the next element is forbidden and the remaining (fc — 2)-sequence 
might be occupied by any independent set; this gives a contribution vSk-2 in Sk^ where the 
factor V takes account of the first element being marked. If an independent set on the fc-sequence 
starts with a non-marked element, then the remaining (k — l)-sequence might be occupied by any 
independent set; this gives a contribution Sk-i in Sk- Therefore 

5**; = vSk-2 + Sk-i for fc > 2, So ^ I, Si = I + V. 

Now define S = S{v,z) := X]a;>o '^'t(^)'^'^- "^^^ recurrence on Sk above multiplied by z'' and 
summed over k > 2 yields S — So — zSi = vz^S + z{S — So)- With 5*0 = 1 and Si = I + v we 
obtain 

_ 1 + vz 

o — „ . 

1 — z — f 

Let us go back to independent sets on the fc-cycle (1, . . . , fc), for A; > 3. In such a set, either 1 is 
occupied, in which case the adjacent elements 2 and k are unoccupied and the remaining segment 
3, . . . , fc — 1 might be occupied by any independent set. This gives contribution vSk-3 to Ck- If 
1 is unoccupied, then the remaining segment 2, . . . , fc might be occupied by any independent set; 
this gives contribution Sk-i to Ck- Consequently 

Ck = vSk-3 + Sk-i for k>3. 

If the root-node v of a tree in T has arity 1 then exactly one of its two incident corners is marked 
(by definition of J^), thus the generating function of trees in T whose root-node has arity 1 is 
2vxF: if V has arity k > 2 then the marked corners around v form an independent set (no two 
consecutive corners are marked). Thus, for k > 2, the generating function of trees in J- whose 
root-node has arity k is Ck+i{v)x'' F'^ (since there are fc + 1 corners incident to the root-node). 
Consequently F satisfies 

F = u + 2vxF + J2'^k+iiv)x''F'' 

k>2 



2vxF + Y,[^Sk-2 + Sk]x''F'' 



k>2 
„2 



= u + 2vxF + vx F^S{v, xF) + {S{v, xF)-l-{l + v)xF) . 

Using the rational expression of S above and rearranging, we obtain 

1 + 2vx'^F'^ ■ (1 + vxF) 

(10) F = u + 2vxF+ ^ ^ ' -xF^l. 

1 — xr — vx'^r ^ 

Now let Q be the family of rooted plane trees with at least one edge and where some corners at 
inner nodes are marked such that (i) each non-root inner node of arity 1 has exactly one marked 
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corner, (ii) two adjacent corners can not both be marked. And let G = G{u, v, x) be the generating 
function of G where u marks the number of leaves, v marks the number of marked corners, and x 
marks the number of edges. The difference between Q and is at the root-vertex: in Q the two 
corners on each side of the root are allowed to be both marked when the root- vertex has more 
than one child, and are allowed to be both unmarked when the root-vertex has one child. So we 
have 

G = J2Sk+i{v)x''F''. 

k>l 

Using the rational expression of S above and rearranging, we obtain the following expression for 
G in terms of F: 

xF{l + 2v + {l+v)vxF) 



(11) G 



1 — xF — vx'^F^ 



Now let g{t, s) be the generating function of G-saturated structures with at least one link, where 
t marks the number of free elements and s marks the number of links. By Lemma [21 

(12) g{t,s) = GiU,V,X), 

where 



f/ = r(i + i), v = - — , X 



l-t l-s 

The conclusion is similar to the other two cases (general structures, saturated structures): 
we have an expression (written as a system of two equations) for the generating function g{t, s) 
enumerating G-saturated secondary structures with at least one link, where t marks the number 
of free elements and s marks the number of links (the generating function of all G-saturated 
secondary structures, including the ones with no link, is g{t,s) + t + t'^ + ■ ■ ■ = g{t,s) + jz^)- 
Indeed, if we define f{t,s) F{U,V, X), then there is a one-line equation specifying f{t,s), of 
the form /(t, s) = Q{t, f{t, s)), with Q{t,s,y) a rational expression whose series-expansion (in 
s, t, y) has nonnegative coefficients. And there is a rational expression R(t, s, y) whose series- 
expansion has nonnegative coefficients and such that g[t^ s) — R{t, s, f{t, s)). Precisely 

^ , . / , t s^+i\ . „ l + 2vx^y'^{l + vxy) 

Q ~ substitute u = t [1 + t)^v ^ , x = into u-t-zwxyH 1— a^y, 

1 — t 1 — sy 1 — xy — vx^'y' 



„ -, . / t s^+i \ . xy(l + 2v + (l + v)vxy) 

R = substitute V = ,x = into — 5-^5 . 

\ I — t l^s/ ^ — xy — vx'^y' 

Again this allows us to extract the counting coefficients. Let gp{t) be the weighted generating 

function of G-saturated secondary structures where t marks the length, and where each structure 

has weight ^#(""^8). ^ g[t,pt^) + t/(l - t); for 6* = 1 and t = we find 

<7p(t) =i + t2 + (l+p)t3 + (l + 3p)<^ + (l + 4p + p2)t5_^(l + 4p + 6p2)i6 + (l + 4p+17p2^p3)^7_^... 

5. Asymptotic results 

5.1. Asymptotic enumeration. We show here that the number of structures of length n follows 
a universal asymptotic behaviour in C7"n~'^/^ (with c and 7 explicit positive constants), which is 
typical of tree-structures. The proof classically relies on the Drmota-Lalley- Woods theorem [HI 
VII. 6], which we recall at first. Consider an equation of the form 

(13) a[t) = ^{t,a{t)), 

where ^{t,y) is a rational expression in t and y. Such an equation is called admissible if the 
following conditions are satisfied: 

• the rational expression $(i, y) has a series-expansion in t and y with nonnegative coeffi- 
cients, is nonaffine in y, and satisfies $(0, 0) = and $.y(0, 0) = 0, 

• the unique generating function y = a{t) solution of (jl3p is aperiodic, i.e., can not be 
written as a{t) ~ t'^a{tP) for some integers p, q with p>2. 



use the subscript notation for partial derivatives. 
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There is an easy criterion to check the aperiodieity condition: it suffices to prove that there is 
some no such that [i"]a(i) > for n> uq. 

Theorem 3 (Drmota-Lalley-Wood). Let y = a{t) be the generating function that is the unique 
solution of an admissible equation y = ^(t,y). Then 

[t"]a{t) - c7"ri-3/2^ 

where 7 = l/ta, with (to,2/o) the unique pair in the convergence domain of ^{t,y) that is solution 
of the singularity system; 

y = <^{t,y), ^y{t,y) = \: 

and where 

c= \Jta'^t{to,yo) /{2TT^y^y{tQ ,yo)). 

Moreover, if'^{t,y) is a rational expression not constant in y, that has a series-expansion with 
nonnegative coefficients, and such that the convergence domain of'^{t,y) is contained in the con- 
vergence domain of ^(t,y), then the coefficients of the generating function b{t) :~'^{t,a{t)) behave 
as 

[f']b{t) - d7"n-3/2, 

where d = c ■ "i!y{tQ, j/q)- 

Remark 1. The Drmota-Lalley-Wood theorem is classically proved (e.g. in |14[ VII. 6] j for poly- 
nomial systems (i.e., for a polynomial). But one easily checks that, more generally, if 
y) is a bivariate series that diverges at all its singularities, then the conclusions remain the 
same. 

From the Drmota-Lahey-Wood theorem we obtain 

Proposition 1. Let p be a fixed positive real value (stickiness parameter). Let gp(t) be the uni- 
variate generating function of general (resp. saturated, resp. G-saturated) homopolymer secondary 
structures, where t marks the length of the sequence and where each structure has weight p'^^^™^^^ 
Then, for any values of the threshold-parameters 9 and r (t = if one considers saturated 
structures), there are computable positive constants c and 7 (depending on t, 6, p, and in which 
setting: general, saturated, or G-saturated) such that 

n5pW~c7"n-3/2. 

Recall that, in each of the three settings (general, saturated, G-saturated), g(t, s) denotes the 
generating function of secondary structures with at least one link, where t marks the number of free 
elements and s marks the number of links. We have seen that, in each of the three settings, there 
are two rational expressions Q{t, s, y) and R{t, s, y) that have nonnegative coefficients (in the series- 
expansion), and there is an adjoint generating function f(t,s) such that f(t,s) = Q{t, s, f(t, s)) 
and g(t, s) = R{t, s, f{t, s)). In addition, the convergence domain of Q{t, s, y) is clearly the same 
as the convergence domain of R(t,s,y); for instance, for G-saturated structures, the convergence 
domain is the set of nonnegative triples (t,s,y) such that t < 1, s < 1, and xy -\- vx^y^ < 1, 
where v = t/(l — t) and x = s'^~^^/{l — s). Note that in all three settings, /(O, 0) = 1 for 6^ = 
and /(0,0) = for 6* > 0. If we set a{t) := f{t,pt^) — le=o (with 9 the threshold parameter) 
and b{t) g{t,pt^), then we are in the conditions of the Drmota-Lalley-Wood theorem, with 
^{t,y) Q{t,pt'^,y -\- le=o) - le=o and 'i'{t,y) := R{t,pt'^,y + le=o). The conditions for $ and 

are readily checked, we show now the aperiodieity of a(t) :— f{t,pt^) (proving that the nth 
coefficient is strictly positive for n large enough). Note that it is enough to consider p = 1 (the 
strict positivity of [t"]f{t) does not depend on p > 0). In each of the three settings (general, 
saturated, G-saturated), a(t) is the enumerative generating function of some explicit class of 
rooted weighted plane trees. For instance, for saturated structures, a(t) counts admissible rooted 
weighted plane trees with all corners of weight at most 9 -\- 1, with at most one positive corner 
per node, and where each node of arity 1 has exactly one positive corner. For i > t, consider 
the weighted rooted plane tree Ti made of one edge e leading to a leaf £, with weight 1 (resp. 0) 
at the corner to the left (resp. right) of the root, with weight « on e and weight 9 on i. And 
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p = l e = 1 r = 


p = 3/8 e = l r = 


General 


1.104366 -71-3/2 .2.618034" 


1.637405 - n-3/2 . 2.041013" 


Saturated 


1.074271 -71-3/2 . 2.354674" 


1.527438 - n-3/2 . 1.705128" 


G-saturated 


1.088582-71-3/2 . 2.436901" 


1.632293 - n-3/2 . 1.826929" 



Table 4. Asymptotic behaviour of the nth coefficient of the generating func- 
tion gp{t) counting secondary structures (general, saturated, or G-saturated) with 
weight p on each link. 



consider the tree T/ defined exactly as Ti except that £ has weight 6 + 1. Note that Ti contributes 
to [t2'+»+3]a(t) and T/ contributes to [t^'+''+^]a{t). Hence [t"]a{t) > for all n > 2r + 6* + 3, 
so a{t) is aperiodic. In exactly the same way, a{t) is aperiodic in the general setting and in the 
G-saturated setting. 

Theorem [3] ensures that there are c > and 7 > such that [f^]g{t,pt^) ^ 07" n"^/^. Actually, 
in the case of general and G-saturated structures, we have 7 > 1 since (according to Theorem [3]) 
there is some yo such that (1/7,2/0) is in the convergence domain of ^{t,y), and since clearly any 
{to, yo) in the convergence domain of $(i, y) satisfies to < 1 (indeed Q{t, s, y) involves the quantity 
l/{l — t), in each of the general and in the G-saturated case). The generating function gp{t) (which 
includes also secondary structures with no link, as opposed to g{t,s)) satisfies gp{t) = g{t,pt'^) + 
t/{l — t) for secondary and for G-saturated structures, and satisfies gp{t) = g{t,pt^)+t + . . . + 
for saturated structures. So the additional term gathering saturated structures with no link has 
negligible asymptotic contribution in all cases. □ 

For p = 1, gp{t) is the enumerative generating function of homopolymer structures. Another 
value of interest is p = 3/8. Indeed, if we want to count RNA secondary structures (each base is 
labelled by a letter in {A, G, C, U}) instead of homopolymers, this corresponds to giving weight 
4 to each free element (because there are 4 possible labels) and giving weight 6 to each pair of 
linked elements (because there are 6 allowed labellings out of 4^ = 16, due to the Watson- Crick 
and wobble pairs). Therefore the corresponding enumerative generating function is g{4t,6t'^). We 
have 

ng{U,6t') = 4"[r]3(f,3tV8) = A^ng./sit). 

In other words, [t"]g3/s is the expected number of RNA secondary structures with the desired 
properties (general, saturated, or G-saturated) on a random sequence of size n (i.e., for a random 
word in {A, G, C, C/}"). 

Table ID shows the asymptotic behaviour of [t"]gp{t) for p = 1 and p = 3/8 in the three settings. 
(The methodology to compute 7 for saturated structures using computer algebra tools is detailed 
in 0.) 

5.2. Limit law for the number of links. Using a theorem of Drmota [T^ (closely related 
to the Drmota-Lallcy-Wood theorem) we show that the number of links in a random secondary 
structure (general, saturated, or G-saturated) of length n is asymptotically a gaussian law with 
Q{n) expectation and &{y^) standard deviation. 
Gonsider an equation of the form 

(14) a{t, u) = $(t, It, a{t, u)), 

where ^{t,u,y) is a rational expression in t, u and y. Such an equation is called admissible if 
^(t,u,y) is nonconstant in u, has a series-expansion (in t, u, y) with nonnegative coefficients, the 
equation y = $(i, 1, y) is admissible (in the sense of Scction [5.ip . and there is a 3 x 3-matrix m[i, j] 
with integer coefficients and nonzero determinant such that [t'"[*'^lu™[''^ly™['''^l]$(<, m, y) > for 
all i e {1,2,3}. 

Theorem 4 (Drmota [lOj). Let y = a{t,u) be a generating function that is the unique solu- 
tion of an admissible equation y = ^(t,u,y). Assume that the generating function b{t,u) = 
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p=l 0=1 T = 


p = 3/8 e = l r = 


General 


0.276393 ■ n + 0.211474 ■ ^ ■ Af 


0.230789 ■ n + 0.218613 -s/n-M 


Saturated 


0.337361 ■ n + 0.132800 ■ y/H ■ M 


0.321153 ■ n + 0.123936 -^-M 


G-saturated 


0.311958 • n + 0.185032 ■ ^ ■ Af 


0.273773 ■ n + 0.211618 • • A/" 



Table 5. Asymptotic behaviour of the number of hnks (M denotes a normal 
gaussian law). 



X^^ee ^'''''''^''■''''^''^(7) "/ weighted combinatorial class Q is given by b{t,u) = '^{t,u,a{t,u)), with 
y) a rational expression with nonnegative coefficients (in the series- expansion) , nonconstant 
in y, and such that the convergence domain o/\I'(t, l,y) is included in the one o/$(i, For 
n > let Qn '■= {7 G Q, \j\ = n}, and define the random variable Xn as x(7), with 7 a random 
structure in Gn under the distribution 

For u > m a neighbourhood o/l, denote by p{u) the radius of convergence of y : t a(t,u), and 
let 

p'(l) p"(l) p'{l) , fp'iiy " 

t^ 



P(i)' p(i) p(i) Vp(i) 

Then p and a are strictly positive and — - — ^ — converges as a random variable to a normal 

<Jy/n 

(gaussian) law. 

Remark 2. Again the theorem was originally proved for polynomial systems, but the arguments of 
the proof hold more generally when 4> is rational. The role of the condition involving the existence 
of a nonsingular 3x3 matrix is to grant the strict positivity of a, as recently proved in . 

Proposition 2. Letp > 0. Forn > 1, let Xn be the number of links in a general (resp. saturated, 
resp. G-saturated) secondary structure of length n taken at random with weight proportional to 
p#(iinks) (ii^jiijQj-jjily at random when p ^1). Then there are computable strictly positive constants 
p and a (depending onp, 0, t, and on which setting: general, saturated, or G-saturated) such that 
[Xn — p ■ n)l y/n converges as a random variable to a normal (gaussian) law. 

In each of the three settings (general, saturated, G-saturated), we have called g{t, s) the enumer- 
ative generating function of secondary structures with at least one link. We have seen that there 
are two rational expressions Q{t, s, y) and R{t, s, y) that have nonnegative coefficients (in the series- 
expansion), and there is an adjoint generating function f{t, s) such that f{t, s) = Q{t, s, f{t, s)) and 
g(t, s) = R{t, s, f(t, s)); and the convergence domain of Q{t, s, y) is the same as the convergence do- 
main of R{t, s, y). Note that the bivariate series g{t,put^) (with variables t and u) is the weighted 
generating function of secondary structures (with at least one link) where t marks the length, u 
marks the number of links, and where each structure has weight It is easily checked 

that, if we set a{t,u) := f{t,put'^) — Iq^q (with 9 the threshold parameter) and b(t) := g{t,put^), 
then we are in the conditions of Theorem 21 with <^(t,u,y) := Q{t,put^ ,y + lg=o) — le=o and 
^(t,M, y) := R{t,put^,y -\- le^o)- Indeed the 3x3 matrix condition is readily checked, and for 
u = 1 we get the equation of Proposition (U where we have already checked that the conditions 
are satisfied. □ 

TableOshows the asymptotic behaviour for some standard parameter values. (The methodology 
to compute p for saturated structures using computer algebra tools is detailed in [S].) The case 
p = 1 corresponds to a homopolymer of length n taken uniformly at random, while the case 
p = 3/8 corresponds to a (uniformly) random secondary structure where the underlying sequence 
is (any word) in {A,G,C,U}". As expected, saturated structures tend to have more links than 
G-saturated structures, which tend to have more links than general structures. 
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Figure 6. Bottom: a secondary structure with dangles (two 5-dangles and three 
3-dangles). Top-left: the secondary structure with the dual plane tree. Top-right: 
each dangle yields a marked edge-side in the dual plane tree (corners at inner 
nodes are simply marked if they have weight 1, are doubly marked if they have 
weight greater than 1). 



6. Inclusion of dangles 

We show here that the counting approach developed so far (based on duality with plane trees, 
generating functions, and substitution operations) can be easily adapted to take the presence of 
so-called dangling bases into account. In the parenthesis representation of the secondary structure 
(see Figure [6l bottom) a dangling base (shortly dangle) is a distinguished free base of two possible 
kinds: a 5-dangle has to be just before of an opening parenthesis, a 3-dangle has to be just after 
of a closing parenthesis. Note that a dangling base that is just before an opening parenthesis and 
just after a closing parenthesis is either a 5-dangle or a 3-dangle but not both. For a structure 
with dangling bases, the underlying secondary structure is the structure where dangling bases are 
considered as usual free bases (i.e., are not distinguished). 

In the dual plane tree, a 5-dangle (resp. a 3-dangle) is indicated by a marked edge-side to 
the left (resp. to the right) of the edge, see Figure [71[b)-(c). To take dangles into account in our 
counting method, we need to distinguish two types of corners in the dual plane tree T: a corner c 
at vertex v is called lateral if c is incident to the edge going to the parent of w in T (when v is not 
the root-vertex) or c is incident to the root (when v is the root-node); note that every inner node 
has two incident lateral corners (one on the left side, one on the right side). The other corners at 
inner nodes in the tree are called extremal^ sec Figure [71(a). Given a corner c of T (at an inner 
node), an edge-side s incident to c is said to depend on c if c is incident to s at the extremity of 
s closest to the root; note that a lateral corner has one depending edge-side while an extremal 
corner has two depending edge-sides, see Figure [7l[a) . 

We now make important observations to determine when the edge-sides depending on a corner 
c can be marked. If c has weight then none of the depending edge-sides can be marked, because 
there is no free base in the sector of c (hence no candidate to become a dangle). If c is lateral 
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A 





(a) 



(b) 



(c) 



(d) 



Figure 7. (a) The first drawing shows a lateral corner, the second drawing shows 
an extremal corner (depending edges are bolder) . (b) A 5-dangle yields a marked 
left-side of edge in the dual tree, (c) A 3-danglc yields a marked right-side of edge 
in the dual tree, (d) Situation of an extremal corner of weight 1, in which case 
the two depending edge-sides can not both be marked. 



and has positive weight (i.e., is a marked corner) then the depending edge-side is allowed to be 
marked. If c is extremal of weight 1 then at most one of the two edge-sides depending on c is 
allowed to be marked (because the unique free base in the sector of c can not be both a 5-dangle 
and a 3-danglc) . If c is extremal of weight at least 2 then the two depending edge-sides are allowed 
to be marked (and are allowed to be both marked). 

Given these observations we can easily include a variable for dangles in our generating function 
expressions (recall we have treated 3 cases: general, saturated, G-saturated) . 
General structures, inclusion of dangles in the results of Section 14.21 Denote by F = 
F{u,vi,V2-, x) the generating function of J- (the one defined in Section l4.2p where u marks the 
number of leaves, vi (rcsp. V2) marks the number of marked corners that are lateral (rcsp. 
extremal), and x marks the number of edges. Since a tree- vertex with k > 1 children has two 
incident corners that are lateral (the fc — 1 other ones are extremal) , we get the following equation 
(which specifies F uniquely): 

F = u+{2vi + vl)xF + J2^''i'^ + ^ifi^ + V2)''~'^F'' 

k>2 

, x{l + v,)^F 



1-X{1+V2)F 



Similarly, denoting by G = G(u, vi,V2,x) the generating function of Q (where the variables have 
the same meaning as for F), we have 

x{l + Vi)'F 



1~X{1 + V2)F' 

Let g{t, s, r) be the generating function counting secondary structures with at least one link, where 
t marks the number of free elements (including dangles), s marks the number of edges, and r marks 
the number of dangles. Then g{t, s, r) = G{U, Vi, V2, X), where 

1-i ' l-t ' 1-s 

For p > 0,9 > 0, let gp^q[t) be the weighted generating function of secondary structures where 
each structure has weight p#(iinks)g#(dangics)^ r^Yien gp,q{t) = g{t,pt^,q) + t/{l - t). For instance, 
for = 1 and r = we find 

ffp,g(i) t + + (1 +p)t^ + (1 + 3p + 2vq)t'^ + (1 + 6p + + 6pq+pq^)t^ + ■■■ . 



Saturated structures, inclusion of dangles in the results of Section 14.31 The equation 
for F obtained in Section 14.31 becomes (when splitting v into two variables vi , V2 respectively for 
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lateral and extremal marked corners): 

F = u + 2vixF + ^{l + 2vi + {k-l)v2)x''F 

k>2 

x^F'^ + 2xF ■ (vi - V2) V2 

— II -\~ -I- - 

1-xF (1 - xFy 

and the expression of G becomes 

G = ^(l + 2i>i + (fc-l)w2)x'=F'= 



fe>i 

xF-{l + 2{vi - V2)) V2 



l-xF {l-xF)^ 



V2- 



A structure with dangles is called saturated if the underlying secondary structure is saturated. Let 
g{t,s,r) be the generating function counting saturated structures with at least one link, where t 
marks the number of free elements (including dangles), s marks the number of edges, and r marks 
the number of dangles. Then g{t, s, r) = G{U, Vi, V2, X), where 

U = t'{l+t), V, = ^—^{l + r), V2 = ^—^{l+rr-tr^ X ^ . 

\ — t 1 — t 1 — s 

For p > 0, g > 0, let gp^q{t) be the weighted generating function of saturated structures where each 
structure has weight p#(iinks)^#(dangies)^ r^.^^^ ^^^^^^^ ^ g{t,pt^, q)+t + ---+ 1^+^. For = 1 and 
T = we find 

gp,q{t) =t + t'^+ pt^ + {3p + 2pq)t^ + (4]3 + p2 + 4pg)t5 + {2p + Qp^ + 2pq + 'ip^q)t^ + • • • . 

G-saturated structures, inclusion of dangles in the results of Section 14.41 Let Cfc(ui, V2) 

be the polynomial generating function for independent sets of the cycle (1, . . . , fc), where vi (resp. 
V2) marks the number of elements of the independent set that belong to {1, fc} (resp. to {2, . . . , fc — 
1}). Let Sk{v) be the polynomial generating function for independent sets of the chain 1, . . . , fc, 
where v marks the number of elements in the independent set. Recall that S{v, z) := X]fc>o ^k{'v)z*' 
is given by 

at \ 1 + 

1 — z — vz'^ 

Then one easily sees that for fc > 3, 

Ck{vi,V2) = 2viSk-3{v2) + Sk-2iv2), 

and the equation for F obtained in Section [4.41 becomes (when splitting v into two variables vi, V2 
respectively for lateral and extremal marked corners): 



F = u + 2vixF + Y,Ck+i{vuV2)x''F'' 

k>2 

= u + 2vixF + ^(2wi5fc_2(v2) + Sk-i{v2))x''F'' 

k>2 

= u + 2vixF + 2vix^F'^ ■ S{v2, xF) + xF ■ {S{v2,xF) - 1), 
which yields the simplified equation 

F^u + 2.,xF^l±^!^l^f^S^±^-xF-l. 

1 — X-t ~ V2X'^b ^ 

And the expression of G becomes at first 

G^Y. {Sk-l{v2) + 2viSk-2{v2) + vlSk-3{v2))x''F\ 
k>l 
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p=l q=l 0=1 r=0 


p = 3/8 q = l 9 = 1 T = 


General 


0.966912 • . 3.079596" 


1.324839 - n-3/2 . 2.421346" 


Saturated 


1.161018 -71-3/2 .2.637053" 


1.661309 - n-3/2 . 1.923212" 


G-saturated 


1.075299 -71-3/2 .2.747414" 


1.545238 - n-3/2 . 2.068940" 



Table 6. Asymptotic behaviour of the nth coefficient of the generating function 
9p.i{t) counting secondary structures (general, saturated, or G-saturated) with 
dangles, with weight p on each link. 





p = l q = l e = 1 T = 


p = 3/8 q = l e = l T = 


General 


0.262126 ■ n + 0.185467 -^-M 


0.228159 ■ n + 0.186545 ■ ^/n ■ Af 


Saturated 


0.328673 ■ n + 0.120696 -^-N 


0.315303 ■ n + 0.112692 ■ y/E ■ M 


(j-saturated 


0.303683 ■ n + 0.166877 -^-N 


0.273631 ■ n + 0.184741 ■ ^ ■ J\f 



Table 7. Asymptotic behaviour of the number of links {JV denotes a normal 
gaussian law) for secondary structures (general, saturated, or G-saturated) with 
dangles, with weight p on each link. 



with the conventions S-i{v) = 1, S-2{v) = 0. Hence we have 

G = xF ■ {l + vixFfS{v2,xF) + 2vixF + vlx^F'^ 
xF ■ {1 + 2vi + xF ■ {v2 + vf)) 
l-xF- V2X^F'^ ■ 

A structure with dangles is called G-saturated if the underlying secondary structure is G-saturated. 
Let g{t,s,r) be the generating function counting G-saturated structures with at least one link, 
where t marks the number of free elements (including dangles), s marks the number of edges, and 
r marks the number of dangles. Then g(t, s, r) = G{U, Vi,V2, X), where 

U = t^il + t), V. = '-^, V2 = '-^^±^-tr\ X 



l-t l-t 1-s 

For p > 0, g > 0, let gp^q{t) be the weighted generating function of G-saturated structures where 
each structure has weight p#(iinks)q#(dangios)^ rj.^^^ ^^^^^^-^ ^ g{t,pt'^,q) +t/{l - t). For 6 = 1 and 
T = we find 

9p,q = t + t^ + {l+p)t^ + {l + 3p + 2pq)t^ + {l + 4:p+p^ + 'ipq)t'^ + {l + 4:p + 6p^ + 4pq + 4:p^q)t'^ + --- . 

Asymptotic results. Propositions [1] and [2] directly extend to any weight q> for dangles (the 
case without dangles is g = 0). We give the numeric values corresponding to g = 1 (asymptotic 
enumeration of structures with dangles) in Tables [6] and [71 which are the counterparts of Tables |4] 
and [5] 

7. Discussion 

In this paper, we presented various context free grammars that generate the set of secondary 
structures, according to different energy models: Nussinov energy, base stacking energy. Turner 
energvF^ Turner with dangles (where dangles are rigorously treated by the method of Markham 
and Zuker [551 US): Turner (with external dangles), as well as saturated and G-saturated struc- 
tures. Using DSV, dominant singularity analysis and the Flajolet-Odlyzko theorem, we proved 
that the asymptotic number of secondary structures with annotated dangles, as computed in the 



^^Exact base stacking parameters are ignored as is entropy; however, the context-free grammar allows the 
separate marking of distinct features, such as stacked base pairs, hairpins, bulges, internal loops, multiloops. 
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partition function of the Markham-Zukcr software UNAFOLD [35], is 0.63998 • . 3.06039", expo- 
nentially larger than the number of all secondary structures 1.104366 -n"^/^- 2.618034" , previously 
established by Stein and Waterman [41]. This result provides a partial explanation for M. Zuker's 
observation (personal communication) that UNAFOLD requires substantially more computation time 
when dangles are included. 

Since the Nussinov energy model and the base stacking energy model superficially appear to 
be almost equivalent, we presented computational results that display their marked differences Q 
In particular, the base stacking energy model leads to more cooperative folding and a higher 
melting temperature for homopolymers than does the Nussinov energy model. Since appears 
intuitively obvious that energy landscapes with many saturated structures (i.e. having a rugged 
energy landscape in the terminology of KaufFman and Levin |23j). we applied a Monte Carlo kinetic 
folding algorithm to determine the correlation between the number of saturated structures and 
the mean first passage time (MFPT) . It now appears to us that a stronger correlation is likely to 
occur if we take a weighted sum of saturated structures, given by X]fc=o ^T*^' '^ti^re denotes the 
set of all saturated secondary structures having k base pairs, where Z^, = X^ses^ exp{— E{S) / RT) 
and Sfc and Z = J2k=i ^k- In the future, we may apply our software RNAsat |45] to compute Zk 
for each k, in order to determine the correlation between the weighted sum of saturated structures 
and the MFPT. 

Finally, in the main part of the paper, we give generating functions for the number of sec- 
ondary structures and locally optimal secondary structures, with respect to the Nussinov model 
and the base stacking energy models, permitting the determination of the asymptotic number of 
(all resp. saturated resp. G-saturated) structures and the expected number of their base pairs, 
optionally requiring a minimum stem length and stickiness parameter. With stickiness parameter 
2(PGC + PAG + PAu) = |j we obtain combinatorial results for RNA sequences using a reasonable 
theoretical model. The principal advantage of our uniform treatment, using duality, substitution 
of generating functions and the Drmota-Lalley- Woods theorem is that with little additional effort, 
we can determine the asymptotic number of (all resp. saturated resp. G-saturated) structures 
with external dangles, and their expected number of base pairs. Such computations would have 
been more difficult using grammars, DSV and singularity analysis. 
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